Candidate number¶

  • Group member 1: Iris Wang
  • Group member 2: Yanni Zhang

Table of contents¶

  • I. Important notebook instruction
  • II. Introduction
  • III. Research questions
  • IV. Dataset
    • 1. Scraping the data
    • 2. Describing the data
    • 3. Cleaning the data
      • Airbnb data
      • NTA data
      • Airbnb_NTA data
  • V. Data visualization
    • 1. Visualizing the correlation between demographic variables
    • 2. Visualizing Airbnb Data at NTA and Borough Level
  • VI. PCA
    • 1. Dimension Reduction for Demographic Data across Different Boroughs
    • 2. Dimension Reduction for Airbnb Data across Different Boroughs
  • VII. Conclusion
  • VIII. Appendix

I. Important notebook instruction ¶

In order to run all cells in this project, there are some neccessary modules which are listed below that need to be run at the beginning. Besides, 3 addtional modules need to be installed, which are:

  1. Plotly (pip install folium plotly==5.5.0)
  2. matplotlib-scalebar(pip install matplotlib-scalebar)
  3. U mapclassify(pip install -U mapclassify)
In [1]:
# Import neccessary modules

import geopandas as gpd
import gzip
import numpy as np
import pandas as pd
import requests
import shutil
import tempfile
import urllib.request
import zipfile

from geopandas import GeoDataFrame
from shapely.geometry import Point

from plotly.subplots import make_subplots
import plotly.graph_objects as go

import matplotlib.pyplot as plt
%matplotlib inline

from matplotlib_scalebar.scalebar import ScaleBar
from mpl_toolkits.axes_grid1 import make_axes_locatable

import seaborn as sns
sns.reset_orig()

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

II. Introduction¶

Airbnb has become an increasingly popular choice of accommodations for those travelling due to the growth of the sharing economy. New York, as a touristy developed city, owns a great number of Airbnb housings. The distribution, characteristics and link of Airbnb in New York City with other features are all interesting issues to investigate. In this perspective, we will study two datasets in this project: the Airbnb dataset and the New York City demography dataset.

We start by scarping these two datasets from the website 'Inside Airbnb' and 'GeoDa Data and Lab' respectively. The 'Inside Airbnb' website periodically publishes snapshots of Airbnb lisitngs around the world. We use the detailed listings data for New York City. The raw dataset describes 37714 observations and 74 variables and was collected between 2021 April to November. The NYC demography dataset contains 195 obserations and 98 variables. The dataset is collected from American Community Survey (ACS) by U.S. Census Bureau. It contains the demographic information for New York City neighborhoods at the NTA level from 2008 to 2012.

III. Research questions¶

In this project, we aim to explore two main quesitons:

  1. How are Airbnbs distributed in NYC?
  2. What is the relationship within and between the demographic features and the airbnb features?
  3. What are the spatial relationships of the demographic features and the airbnb features with Borough?

To answer the first question, we will use interactive map and choropleth maps to visualize the distribution of Airbnb features in NYC.

For the second question, we will examine the correlation between demographic variables using heatmaps, and we will examine the correlations between demographic characteristics and airbnb variables using scatter plots and histograms.

In order to answer the third question, we will use PCA to collect demographic and airbnb variables in order to see whether these variables can cluster by targets.

IV. Dataset¶

1. Scraping the data¶

In [2]:
# Scarping the Airbnb data by url

AirbnbURL = "http://data.insideairbnb.com/united-states/ny/new-york-city/2021-11-02/data/listings.csv.gz"
urllib.request.urlretrieve(AirbnbURL, "listing.csv.gz")
with gzip.open('listing.csv.gz', 'rb') as f_in:
    with open('listing.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
airbnb = pd.read_csv('listing.csv',low_memory=False)
airbnb.head(4)
Out[2]:
id listing_url scrape_id last_scraped name description neighborhood_overview picture_url host_id host_url ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2595 https://www.airbnb.com/rooms/2595 20211102175544 2021-11-03 Skylit Midtown Castle Beautiful, spacious skylit studio in the heart... Centrally located in the heart of Manhattan ju... https://a0.muscache.com/pictures/f0813a11-40b2... 2845 https://www.airbnb.com/users/show/2845 ... 4.79 4.86 4.41 NaN f 3 3 0 0 0.33
1 3831 https://www.airbnb.com/rooms/3831 20211102175544 2021-11-03 Whole flr w/private bdrm, bath & kitchen(pls r... Enjoy 500 s.f. top floor in 1899 brownstone, w... Just the right mix of urban center and local n... https://a0.muscache.com/pictures/e49999c2-9fd5... 4869 https://www.airbnb.com/users/show/4869 ... 4.81 4.72 4.65 NaN f 1 1 0 0 4.91
2 5121 https://www.airbnb.com/rooms/5121 20211102175544 2021-11-03 BlissArtsSpace! <b>The space</b><br />HELLO EVERYONE AND THANK... NaN https://a0.muscache.com/pictures/2090980c-b68e... 7356 https://www.airbnb.com/users/show/7356 ... 4.91 4.47 4.52 NaN f 2 0 2 0 0.53
3 5136 https://www.airbnb.com/rooms/5136 20211102175544 2021-11-03 Spacious Brooklyn Duplex, Patio + Garden We welcome you to stay in our lovely 2 br dupl... NaN https://a0.muscache.com/pictures/miso/Hosting-... 7378 https://www.airbnb.com/users/show/7378 ... 5.00 4.50 5.00 NaN f 1 1 0 0 0.02

4 rows × 74 columns

In [3]:
# Scraping the NYC demographic data by url

def get_data():
    url = "https://geodacenter.github.io/data-and-lab/data/nycnhood_acs.zip"
    response = requests.get(url)
    return url, response.content


if __name__ == '__main__':
    url, data = get_data()  

    _tmp_file = tempfile.TemporaryFile()  
    print(_tmp_file)

    _tmp_file.write(data) 
    

    zf = zipfile.ZipFile(_tmp_file, mode='r')
    for names in zf.namelist():
        f = zf.extract(names, './zip')  
        print(f)

    zf.close()
    
NTA = gpd.read_file("zip/NYC_Nhood ACS2008_12.shp")
NTA.head(4)
<_io.BufferedRandom name=62>
zip/NYC_Nhood ACS2008_12.dbf
zip/__MACOSX
zip/__MACOSX/._NYC_Nhood ACS2008_12.dbf
zip/NYC_Nhood ACS2008_12.prj
zip/__MACOSX/._NYC_Nhood ACS2008_12.prj
zip/NYC_Nhood ACS2008_12.shp
zip/__MACOSX/._NYC_Nhood ACS2008_12.shp
zip/NYC_Nhood ACS2008_12.shx
zip/__MACOSX/._NYC_Nhood ACS2008_12.shx
Out[3]:
UEMPRATE cartodb_id borocode withssi withsocial withpubass struggling profession popunemplo poptot ... boroname popdty ntacode medianinco medianagem medianagef medianage HHsize gini geometry
0 0.095785 1 3 652 5067 277 6421 889 2225 48351 ... Brooklyn 497498.701 BK45 1520979 663.3 777.1 722.6 2.96421052631579 0.386315789473684 POLYGON ((-73.91716 40.63173, -73.91722 40.631...
1 0.090011 2 3 2089 7132 1016 10981 1075 2652 61584 ... Brooklyn 589296.926 BK17 1054259 791.4 868.5 827.6 2.46578947368421 0.448089473684211 POLYGON ((-73.91809 40.58657, -73.91813 40.586...
2 0.130393 3 3 3231 8847 2891 21235 712 6483 100130 ... Brooklyn 1506628.84 BK61 980637 863.1 983.9 923.8 2.42925925925926 0.473666666666667 POLYGON ((-73.92165 40.67887, -73.92171 40.678...
3 0.086633 4 3 1103 3508 553 7188 475 1709 33155 ... Brooklyn 468975.876 BK90 519058 333.6 350.1 341.3 2.189 0.44139 POLYGON ((-73.92406 40.71411, -73.92404 40.714...

4 rows × 99 columns

2. Describing the data¶

Visualize New York City Neighborhood Tabulation Areas¶

New York City (NYC) is distinguished for its unique neighborhood division. It is separated into five boroughs: Bronx, Brooklyn, Manhattan, Queens, and Staten Island. All these five boroughs togetherly made up this diverse economic and cultural metropolis in the United State.

However, the borough is still quite large when we want to have a better understanding of neighborhoods throughout New York City. Hence, the Department of New York City Planning combined census data with New York City's fifty-five Public Use Microdata Areas (PUMAs) and created Neighborhood Tabulation Areas(NTAs) for more detailed division. These medium-sized geographic areas subset the whole city into 195 small blocks, which could better guide urban policymaking process in New York City.

In [4]:
# Count the number of NTAs in New York City 

NTA['ntaname'].describe()
Out[4]:
count                  195
unique                 195
top       Sunset Park East
freq                     1
Name: ntaname, dtype: object
In [5]:
# Display the list of neighborhood tabulation areas and their belonging boroughs

ntaboroughgeo = NTA[['boroname','ntaname']]
ntaboroughgeo = ntaboroughgeo.rename(columns={'ntaname': 'Neighborhood Tabulation Areas', 'boroname': 'Borough'})
ntaboroughgeo = ntaboroughgeo.sort_values(by = 'Borough', ignore_index=True)
#ntaboroughgeo.style.hide_index()
ntaboroughgeo.head(4).style.hide_index() #We list four NTAs as examples
Out[5]:
Borough Neighborhood Tabulation Areas
Bronx Parkchester
Bronx Van Nest-Morris Park-Westchester Square
Bronx Claremont-Bathgate
Bronx Westchester-Unionport
In [6]:
# Visualize the distribution of neighborhood tabulation areas in each boroughs

boro_num = ntaboroughgeo.groupby(['Borough']).size().reset_index(name='Counts')
size=boro_num['Counts']

fig = go.Figure(data=[go.Scatter(
    x=boro_num['Borough'], y=boro_num['Counts'],
    mode='markers',
    marker=dict(
        color=['rgb(93, 164, 214)', 'rgb(255, 144, 14)',  'rgb(44, 160, 101)', 'rgb(255, 65, 54)','rgb(31, 17, 17)'],
        size=boro_num['Counts'],
        sizemode='area',
        sizeref=1.*max(size)/(70.**2),
        sizemin=4
    )
)])

fig.update_layout(
    title='The Number of Neighborhood Tabulation Areas in Each Boroughs',
    xaxis=dict(title='New York City Borough'),
    yaxis=dict(title='Number of NTA')
)

fig.show()

According to the statistics above, there is a total number of 195 neighborhood tabulation areas in New York City. The bubble chart effectively compares the number of neighborhood tabulation areas among five boroughs. We can see that: Queens has the highest amount of neighborhood tabulation areas(58), whereas Staten Island has the smallest number(19) of neighborhood tabulation areas. Brooklyn, Bronx, and Manhattan have 51, 38, and 29 NTAs respectively.

In [7]:
# Use interactive map to visualize how NTAs locate in boroughs

NTA.explore(
     column="boroname", 
     tooltip="ntaname", 
     popup=True, 
     tiles="CartoDB positron", 
     cmap="Set1", 
     style_kwds=dict(color="black") 
    )
Out[7]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The interactive map gives us a direct visualization of neighborhood tabulation areas in the city. Queens still ranks at the top when calculating the land area of each Borough. The second largest Borough is Brooklyn, followed by Staten Island. Bronx and Manhattan is the fourth and fifth largest Borough. We will then use these NTA data together with Airbnb data for further visualization.

3. Cleaning the data¶

Airbnb data¶

After scrapping the airbnb data from the website, we clean this dataset by the following steps:

  1. Drop some variables such as the url linking and scarpe id, which we think are unrelated to the purposes of this analysis.
  2. Convert the percentage number representations of "host response rate" to float.
  3. Convert the currency string of "price" to float
In [8]:
# Drop unrelated columns in airbnb dataset

airbnb = airbnb.drop(['listing_url', 'scrape_id', 'last_scraped','neighborhood_overview', 'picture_url', 'host_url',
       'host_name', 'host_since',
       'host_response_time', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'bathrooms', 'amenities', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'calendar_last_scraped', 'first_review',
       'last_review', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'license', 'instant_bookable',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'], axis=1)

# Convert the data types into numeric format
airbnb['host_response_rate'] = airbnb['host_response_rate'].str.rstrip('%').astype('float') / 100.0
airbnb['price'] = airbnb['price'].replace({'\$': '', ',': ''}, regex=True).astype(float)
airbnb=airbnb.dropna()

NTA data¶

As we mention in the last section, this data set contains demographic information about New York City (NYC) at NTA (the neighborhoods in NYC) level, thus we name this dataframe as 'NTA'.

We first investigtes the 'ntaname' column which refers to the name for Neighborhood Tabulation Area and find there are some names such as the park and airport whose existence is unjustified because there should not have airbnb housings appearing in these places. Therefore, we drop the observaions whose ntaname contains words 'Park or 'Airport'.

As we want to use all demographic features in this dataset, we need to make sure all variables are numeric. So we convert 7 columns from object to float.

In [9]:
# Check the NTA name in NTA dataset

NTA.ntaname.head(4) #List 4 ntanames as examples
Out[9]:
0    Georgetown-Marine Park-Bergen Beach-Mill Basin
1    Sheepshead Bay-Gerritsen Beach-Manhattan Beach
2                               Crown Heights North
3                                 East Williamsburg
Name: ntaname, dtype: object
In [10]:
# Drop the observations whose ntamname contains park or airport

NTA=NTA[~NTA.ntaname.str.contains("Park|Airport")]
In [11]:
# Convert the data type into numeric format

cols = ['popdty', 'medianinco', 'medianagem','medianagef', 'medianage', 'HHsize','gini']
NTA[cols] = NTA[cols].apply(pd.to_numeric, errors='coerce', axis=1)
NTA=NTA.dropna()

Airbnb_NTA data¶

Before implementing the last step of data cleaning, we need to merge these two datasets. Since we are combining two geopandas dataframes, we need to reproject these two geodataframe into the same CRS(Coordinates Reference System) first. We choose "EPSG:32118" because it refers to the New York Long Island and thus ensures that the map can be projected correctly.

Since the main Airbnb features we focus on are price, the number of reviews, our data cleaning will primarily focus on dropping the outliers of these two variables. The other main feature we focus on is the number of Airbnb housings, which will be calculated later.

In [12]:
# Change the CRS of both datasets

NTA = NTA.to_crs("EPSG:32118")
geometry = [Point(xy) for xy in zip(airbnb.longitude, airbnb.latitude)] #Transfer the airbnb dataframe into geodataframe
Airbnb_sf = GeoDataFrame(airbnb, crs="EPSG:4326", geometry=geometry)
Airbnb_sf=Airbnb_sf.to_crs("EPSG:32118")
Airbnb_NTA=gpd.sjoin(NTA,Airbnb_sf) #spatial join datasets
In [13]:
# Drop the outliers of "number of reviews" and "price"

cols = ['number_of_reviews', 'price']

Q1 = Airbnb_NTA[cols].quantile(0.25)
Q3 = Airbnb_NTA[cols].quantile(0.75)
IQR = Q3 - Q1

df = Airbnb_NTA[~((Airbnb_NTA[cols] < (Q1 - 1.5 * IQR)) |(Airbnb_NTA[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
In [14]:
# Visualize and compare the dataset before and after dropping the outliers by Plotly

fig = make_subplots(rows=2, cols=2)

fig.add_trace(go.Box(
            name="Price with ourliers",
            y=Airbnb_NTA["price"],
        ),
        row=1, col=1
)
fig.add_trace(go.Box(
            name="Number of reviews with outliers",
            y=Airbnb_NTA["number_of_reviews"],
        ),
        row=1, col=2
)
fig.add_trace(go.Box(
            name="Price without ourliers",
            y=df["price"],),row=2, col=1)

fig.add_trace(go.Box(
            name="Number of reviews without outliers",
            y=df["number_of_reviews"]), 
        row=2, col=2
)

fig.update_layout(title_text="Compared Original and Cleansed Data in Box Plots")
fig.show()

The box plots above compare the "price" and "number of reviews" before and after excluding the outliers respectively. It is evident that the outliers have been clearly reduced in the lower two box plots.

V. Data visualization¶

1. Visualizing the correlation between demographic variables¶

These two heatmaps are both based on the NTA data. This demographic data includes an elaborate list of variables and covers all neighborhood tabulation areas in NYC. The variables can generally be separated into five groups: basic demographic information, social-economic indicators, educational background, working conditions, and ethnic distribution.

The first heatmap visualizes all correlation relationships in the NTA data, whereas the second heatmap only selects 12 important variables for detailed illustration. We mainly select three groups of data: economic (living) conditions, educational background, and ethnic demographics. It is clear to see that the economic condition of a citizen is strongly correlated with his educational level. People who achieved a Master's degree are unlikely to struggle or be poor. Citizens who have a high school diploma or above have a greater chance live okay. However, people who did not attend high school are strongly possible to be poor or struggle in society. This heatmap also shows some other interesting correlations. For example, the population of Hispanic Americans is positively correlated with poor economic conditions and low educational levels. African American presents a negative correlation with high educational background. Asian Americans generally have better economic conditions than other ethnic groups.

In sum, the correlation heatmaps turn the NTA data more meaningful and exploratory. It also points out some specific correlations between different types of variables.

In [15]:
# Plot the correlation heatmap of NTA data

plt.figure(figsize=(25, 25))
heatmap = sns.heatmap(NTA.corr(), vmin=-1, vmax=1)
heatmap.set_title('Demographic Correlation Heatmap for NTA Data', fontdict={'fontsize':25}, pad=30);
In [16]:
# Select certain columns to plot a more detailed heatmap

small_heatmap = NTA[['struggling','poor','okay','onlymaster','onlybachel','onlyhighsc','onlylessth','pacific', 'hispanic','asian','american','african']]

small_heatmap = small_heatmap.rename(columns={'poor': 'Doing poorly', 'struggling': 'Struggling','okay': 'Doing okay', 
                                             'onlymaster': 'Only One Master','onlybachel': 'Only Bachelor','onlyhighsc':'Only Highschool', 'onlylessth':'Less than Highschool',
                                             'pacific': 'Pacific Islander', 'hispanic':'Hispanic American','asian':'Aisan American','american':'Indian American','african':'African American'})
plt.figure(figsize=(10,6))
mask = np.triu(np.ones_like(small_heatmap.corr()))
small_heatmap = sns.heatmap(small_heatmap.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='coolwarm')
small_heatmap.set_title('Triangle Correlation Heatmap for Selected Demographic Data', fontdict={'fontsize':18}, pad=16); 

2. Visualizing Airbnb Data at NTA and Borough Level¶

In this section, we categorize the average price, the number of airbnb reviews and the average airbnb score by five Boroughs and count the number of airbnb in each Borough. The bar charts below depict the distribution of these four variables across different Boroughs, while the maps provide a general visualization of the Airbnb Data in New York City.

In [17]:
# Group Airbnb_NTA dataset by Borough 

df1 = df[['price','review_scores_rating','geometry','neighbourhood_group_cleansed']]
df1 = df1.dissolve(by='neighbourhood_group_cleansed', aggfunc='mean')
df2 = df.pivot_table(
    ['number_of_reviews'],
    ['neighbourhood_group_cleansed'],
    aggfunc=np.sum)
df2 = pd.merge(df1, df2, on='neighbourhood_group_cleansed')
df3 = df.groupby(['neighbourhood_group_cleansed']).size().reset_index(name='counts')

boroname = pd.merge(df2, df3, on='neighbourhood_group_cleansed')
borough = boroname.rename(columns={'neighbourhood_group_cleansed': 'Borough', 'number_of_reviews': 'Total Number of Reviews', 
                                  'price': 'Average Price', 'review_scores_rating': 'Average Review Score Rating', 'counts': 'Total Number of Airbnb','geometry':'geometry'})

borough = borough[['Borough','Total Number of Airbnb', 'Average Price','Total Number of Reviews', 'Average Review Score Rating']]

#borough=borough.set_index('Borough')
borough.style.hide_index()
Out[17]:
Borough Total Number of Airbnb Average Price Total Number of Reviews Average Review Score Rating
Bronx 194 92.932990 7665 4.697526
Brooklyn 2091 126.954567 88958 4.716475
Manhattan 1641 151.929311 58676 4.696027
Queens 631 100.450079 24681 4.729794
Staten Island 86 105.674419 4639 4.774535
In [26]:
# Display Airbnb data in bar charts

fig = make_subplots(rows=2, cols=2, subplot_titles=('Total Number of Airbnb','Average Price',
                                                    'Total Number of Reviews','Average Review Score Rating'))

fig.add_trace(go.Bar(
    name="Number of Airbnb",
    x=borough['Borough'],
    y=borough['Total Number of Airbnb']),
              row=1, col=1)

fig.add_trace(go.Bar(
            name="Average Price",
            x=borough["Borough"],
            y=borough["Average Price"],
            offsetgroup=0,
        ),
        row=1, col=2
)
fig.add_trace(go.Bar(
            name="Number of Reviews",
            x=borough["Borough"],
            y=borough["Total Number of Reviews"],
            offsetgroup=1,
        ),
        row=2, col=1
)
fig.add_trace(go.Bar(
            name="Average Review Score Rating",
            x=borough["Borough"],
            y=borough["Average Review Score Rating"],
            offsetgroup=2,),
            row=2, col=2)

fig.update_layout(title_text="Airbnb Data Among Different NYC Boroughs")
fig.show()
In [19]:
# Select required columns to generate new dataframe for further analysis
df3 = df[['price','review_scores_rating','geometry','ntaname']]
df3 = df3.dissolve(by='ntaname', aggfunc='mean')
df4 = df.pivot_table(
    ['number_of_reviews'],
    ['ntaname'],
    aggfunc=np.sum)
df4 = pd.merge(df3, df4, on='ntaname')
df5 = df.groupby(['ntaname']).size().reset_index(name='counts')
ntaname = pd.merge(df4, df5, on='ntaname') #groupby airbnb data by Neighborhood Tabulation Area 

df6 = df[['price','review_scores_rating','geometry','neighbourhood_group_cleansed']]
df6 = df6.dissolve(by='neighbourhood_group_cleansed', aggfunc='mean')
df7 = df.pivot_table(
    ['number_of_reviews'],
    ['neighbourhood_group_cleansed'],
    aggfunc=np.sum)
df7 = pd.merge(df6, df7, on='neighbourhood_group_cleansed')
df8 = df.groupby(['neighbourhood_group_cleansed']).size().reset_index(name='counts')
boroname = pd.merge(df7, df8, on='neighbourhood_group_cleansed') #groupby airbnb data by Borough


# Plot Airbnb data by NTA and Borough
fig, [[ax1, ax2],[ax3, ax4],[ax5, ax6],[ax7, ax8]] = plt.subplots(4, 2,figsize=(20, 40))


# Characterized by average Aibnb price
price_nta_plot = ntaname.plot(column="price",legend=True,ax=ax1,cmap='Reds',scheme='quantiles', 
             legend_kwds=dict(loc='upper left', fmt= "{:.0f}", title="Price", 
                              fontsize = 'x-large',frameon=True))
price_nta_plot.add_artist(ScaleBar(1))
price_nta_plot.set_title("Average Airbnb Price by NTA", fontsize=15)

price_boro_plot = boroname.plot(column="price",legend=True,ax=ax2,cmap='Reds',scheme='quantiles', 
             legend_kwds=dict(loc='upper left', fmt= "{:.0f}", title="Price", 
                              fontsize = 'x-large',frameon=True))
price_boro_plot.add_artist(ScaleBar(1))
price_boro_plot.set_title("Average Airbnb Price by Borough", fontsize=15)


# Characterized by total number of Airbnb
count_nta_plot = ntaname.plot(column="counts",legend=True,ax=ax3,cmap='YlOrBr',scheme='quantiles',
             legend_kwds=dict(loc='upper left', fmt= "{:.0f}", title="Number of Airbnb", 
                              fontsize = 'x-large', frameon=True))
count_nta_plot.add_artist(ScaleBar(1))
count_nta_plot.set_title("Number of Airbnb by NTA", fontsize=15)

count_boro_plot = boroname.plot(column="counts",legend=True,ax=ax4,cmap='YlOrBr',scheme='quantiles',
             legend_kwds=dict(loc='upper left', fmt= "{:.0f}", title="Number of Airbnb", 
                              fontsize = 'x-large', frameon=True))
count_boro_plot.add_artist(ScaleBar(1))
count_boro_plot.set_title("Number of Airbnb by Borough", fontsize=15)


# Characterized by total number of reviews of Airbnb
review_nta_plot = ntaname.plot(column="number_of_reviews",legend=True,ax=ax5,cmap='RdPu',scheme='quantiles',
             legend_kwds=dict(loc='upper left', fmt= "{:.0f}", title="Number of Reviews", 
                              fontsize = 'x-large', frameon=True))
review_nta_plot.add_artist(ScaleBar(1))
review_nta_plot.set_title("Number of Reviews by NTA ", fontsize=15)

review_boro_plot = boroname.plot(column="number_of_reviews",legend=True,ax=ax6,cmap='RdPu',scheme='quantiles',
             legend_kwds=dict(loc='upper left', fmt= "{:.0f}", title="Number of Reviews", 
                              fontsize = 'x-large', frameon=True))
review_boro_plot.add_artist(ScaleBar(1))
review_boro_plot.set_title("Number of Reviews by Borough ", fontsize=15)


# Characterized by average score of Airbnb
score_nta_plot = ntaname.plot(column="review_scores_rating",legend=True,ax=ax7,cmap='YlGnBu',scheme='quantiles',
             legend_kwds=dict(loc='upper left', title="Average Scores Rating", 
                              fontsize = 'x-large',frameon=True))
score_nta_plot.add_artist(ScaleBar(1))
score_nta_plot.set_title("Average Scores by NTA ", fontsize=15)

score_boro_plot = boroname.plot(column="review_scores_rating",legend=True,ax=ax8,cmap='YlGnBu',scheme='quantiles',
             legend_kwds=dict(loc='upper left', title="Average Scores Rating", 
                              fontsize = 'x-large',frameon=True))
score_boro_plot.add_artist(ScaleBar(1))
score_boro_plot.set_title("Average Scores by Borough ", fontsize=15)


for ax in (ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8):
    ax.set_axis_off()
  • Average Airbnb Price in NYC

Beyond all doubts, Manhattan, the world's supremest financial and commercial center lists as the most expansive borough in New York City. The average Airbnb price in Manhattan exceeds 130 dollars per day, while almost every NTA in Manhattan marks as dark brown (over 140 dollars per day). Interestingly, Brooklyn has the second-highest average Airbnb price. This is partly due to Brooklyn being a diverse borough. The areas near Manhattan have higher Airbnb prices, whereas the areas in the southern part have lower prices. Staten Island and Queens rank as the third and fourth in terms of average Airbnb price. However, there are some exceptional NTAs which has a higher price. For example, Todt Hill in Staten Island is famous for its affluent neighborhood. Its high Airbnb price is also partly because it is the highest natural point in NYC, at roughly 400 feet above sea level, which provides breathtaking ocean vistas from homes. Bayside in Queens has also separated itself from other nearby neighbors as this area is well known for its suburb enclaves.

  • Number of Airbnb in NYC

According to these two maps, the distribution of Airbnb in New York City has large disparities among different boroughs. Airbnbs in NYC mainly concentrates in the downtown areas. Without doubts, Manhattan and Brooklyn have the highest number of Airbnbs. Both boroughs have over 1000 Airbnb available. One possible reason for these high numbers of Airbnb is that most Airbnbs are booked by tourists. While these two boroughs get the major attractions, museums, and theatres in NYC. Whereas, Staten Island, is quite far away from the city center and is the only borough that is not linked with any subway. Thus, it is reasonable to see single-digit Airbnbs in Staten Island's neighborhood areas. Other areas which have less Airbnbs are the eastern part of Queens and Bronx. These parts are all far away from the city center.

  • Number of Airbnb Reviews in NYC

These two maps are quite similar to the former Number of Airbnb Map, which proves Manhattan and Brooklyn as the most popular boroughs. Brooklyn gets a total of 90,000 reviews which is ten times larger than Staten Island. We can also see that areas around the city center are mostly welcomed. The western neighbourhood areas in Queens and Bronx also received thousands of reviews. Therefore, it is easy to speculate that downtown is most preferable for Airbnb users.

  • Average Airbnb Score in NYC

The average review score map presents a different story. Manhattan and Brooklyn are no longer the leading boroughs in this category. On the other hand, Staten Island and Queens have the best review score. One reason may be these two areas have fewer Airbnbs, which means they have less chance to receive lower scores. The other reason maybe because they are far away from the city center. Therefore these areas are less noisy and messy. However, the gaps between scores are quite small. So it is too early to draw any conclusions.

Overall, these choropleth maps provide us a general picture of Airbnbs in New York City on their prices, numbers, and reviews. The visualizations are also supported by our common sense. For example, the city center is the most popular area for Airbnb and therefore, has the highest house price.

3. Visualizing the relationship between airbnb variables and demographic variables¶

In this section, we look at whether the distribution of airbnbs is related to demographic data. This is accomplished by examining the correlations between "price," "number of reviews," and "counts" and all demographic characteristics.

As we have 95 demographic features, it is unrealistic to visualize them all. Therefore, we choose to visualize the variables with the highest correlation with "price","number of reviews" and "counts" respectively. The variables we selected are the population with education attainment of a professional degree level("profession"), the number of workers who commute 30 to 44 minutes to work("comm_30_44"). The price variable has the highest correlation with "profession", whereas both "number of reviews" and "counts" have the highest coorrelation with "comm_30_44".

In [20]:
# Group NTA dataset by ntaname and combine it with ntaname dataset, name the new dateframe 'df_nta'
df_nta=pd.merge(NTA,ntaname,on='ntaname')

# Calculate the correlation between variables in df_nta
df_corr=df_nta.corr()
df_corr=df_corr[['price','number_of_reviews','counts']]
df_corr
Out[20]:
price number_of_reviews counts
UEMPRATE -0.358264 0.014199 -0.044651
cartodb_id 0.067850 0.067361 0.083868
borocode -0.191225 -0.258665 -0.319485
withssi -0.125771 0.413628 0.394887
withsocial 0.199046 0.355221 0.417570
... ... ... ...
gini 0.280074 0.362267 0.415898
price 1.000000 0.242029 0.266160
review_scores_rating 0.081558 -0.034113 -0.061326
number_of_reviews 0.242029 1.000000 0.964738
counts 0.266160 0.964738 1.000000

99 rows × 3 columns

In [21]:
# Sort the correlations by "price","number of reviews" and "number of airbnbs" respectively
print("The highest correlation for price \n", df_corr['price'].nlargest(3))
print("The highest correlation for the number of reviews \n", df_corr['number_of_reviews'].nlargest(3))
print("The highest correlation for the number of airbnbs \n", df_corr['counts'].nlargest(3))
The highest correlation for price 
 price         1.000000
male_BA       0.427374
profession    0.425492
Name: price, dtype: float64
The highest correlation for the number of reviews 
 number_of_reviews    1.000000
counts               0.964738
comm_30_44           0.565174
Name: number_of_reviews, dtype: float64
The highest correlation for the number of airbnbs 
 counts               1.000000
number_of_reviews    0.964738
comm_30_44           0.636932
Name: counts, dtype: float64
In [22]:
# Convert the variables into log scale
df_nta['lnprice']=np.log(df_nta['price'])
df_nta['lnreviews']=np.log(df_nta['number_of_reviews'])
df_nta['lnprofession']=np.log(df_nta['profession'])
df_nta['lncomm_30_44']=np.log(df_nta['comm_30_44'])
df_nta['lncounts']=np.log(df_nta['counts'])

We use jointplot to display both the histogram and scatter diagram. For better visualization, we shift all variables into a log scale. All three figure clearly demonstrates the existence of highly positive associations. With regard to the histograms, we find that the price variable looks like a normal distribution, while the "profession" variables and "comm_30_44" have very few extreme values. However, the distribution of the number of airbnbs is relatively even in all intervals.

In [27]:
# Visualize the relationships with highest correlation values by joinplot 

f1=sns.jointplot(x=df_nta['lnprice'],y=df_nta['lnprofession'],color='skyblue')
f2=sns.jointplot(x=df_nta['lnreviews'],y=df_nta['lncomm_30_44'],color='gold')
f3=sns.jointplot(x=df_nta['lncounts'],y=df_nta['lncomm_30_44'],color='lightgreen')

f1.fig.suptitle("Price vs Education attainment with a professional degree")
f2.fig.suptitle("Number of reviews vs Commute to work between 30 to 44 min")
f3.fig.suptitle("Number of airbnbs vs Commute to work between 30 to 44 min")

f1.fig.subplots_adjust(top=0.95)
f2.fig.subplots_adjust(top=0.95)
f3.fig.subplots_adjust(top=0.95)

VI. PCA¶

The purpose of using PCA is to reduce the dimensionality of a data set with numerious variables by transforming them into lower dimensions while still preserving the information of the huge data set, therefore people can visualize it by their eyes.

1. Dimension Reduction for Demographic Data across Different Boroughs¶

In the first part of this section, we want to explore the spatial relationship between demographic features with different boroughs. In other words, we want to investigate whether the points from the same borough can be well-clustered in the scatter diagram. If this is the case, it suggests that their demographic features are correlated within the same borough.

The result shows that Bronx, Queen and Staten Island indeed present a well cluster, whereas the other two boroughs are not. However, all types of borough do not cluster seperatly, implying that the demographic features are also significantly connected across different boroughs.

In [28]:
# Select all demographic features
features=['UEMPRATE','withssi', 'withsocial',
       'withpubass', 'struggling', 'profession', 'popunemplo', 'poptot',
       'popover18', 'popinlabou', 'poororstru', 'poor', 'pacificune',
       'pacificinl', 'pacific', 'otherunemp', 'otherinlab', 'otherethni',
       'onlyprofes', 'onlymaster', 'onlylessth', 'onlyhighsc', 'onlydoctor',
       'onlycolleg', 'onlybachel', 'okay', 'mixedunemp', 'mixedinlab', 'mixed',
       'master', 'maleunempl', 'maleover18', 'male_pro', 'male_mastr',
       'male_lesHS', 'male_HS', 'male_doctr', 'male_collg', 'male_BA',
       'maleinlabo', 'maledrop', 'male16to19', 'male', 'lessthan10',
       'lessthanhi', 'households', 'hispanicun', 'hispanicin', 'hispanic',
       'highschool', 'field_1', 'femaleunem', 'femaleover', 'fem_profes',
       'fem_master', 'fem_lessHS', 'fem_HS', 'fem_doctor', 'fem_colleg',
       'fem_BA', 'femaleinla', 'femaledrop', 'femal16_19', 'female',
       'europeanun', 'europeanin', 'european', 'doctorate', 'comm90plus',
       'comm_less5', 'comm_60_89', 'comm_5_14', 'comm_45_59', 'comm_30_44',
       'comm_15_29', 'college', 'bachelor', 'asianunemp', 'asianinlab',
       'asian', 'americanun', 'americanin', 'american', 'africanune',
       'africaninl', 'african','popdty', 
       'medianinco', 'medianagem', 'medianagef', 'medianage', 'HHsize', 'gini']

# Separating the numeric
x = NTA.loc[:, features].values
# Separating out the target
y = NTA.loc[:,['boroname']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

# Dimensions reduction by PCA 
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
Y = pca.fit_transform(x)

principalDf = pd.DataFrame(data = Y
             , columns = ['principal component 1', 'principal component 2'])
df_boroname=NTA[['boroname']].reset_index()
finalDf = pd.concat([principalDf, df_boroname], axis = 1)

# Scatter plot the 2-dimensional PCA demographic features grouped by borough
import plotly.express as px
fig = px.scatter(Y, x=finalDf['principal component 1'], y=finalDf['principal component 2'], 
                 symbol=finalDf['boroname'],color=finalDf['boroname'], size_max=1)

fig.update_layout(title_text="Dimension Reduction Demographic Data across Different Boroughs")
fig.show()

2. Dimension Reduction for Airbnb Data across Different Boroughs¶

This PCA analysis intends to visualize the large Airbnb data frame. However, as the graph shows, the cells are not seperate clearly. In other words, we cannot separate the Airbnbs in New York City clearly by their features. Most of the cells cluster with each other, which means they are highly correlated.

One possible reason might be we have dropped the outliers in the Airbnb data. Thus, some important values that may separate these houses have been taken out. Another reason might due to the similarity of Airbnbs across different boroughs. For example, they have similar bedrooms, accommodates, etc. Therefore, the Airbnbs in New York City are generally similar to each other.

In [31]:
# Select features of airbnb

PCA_airbnb = df[['price','number_of_reviews','accommodates','bedrooms','beds','minimum_nights','maximum_nights',
                 'maximum_minimum_nights','availability_30', 'availability_60', 'availability_90','availability_365',
                 'number_of_reviews_ltm', 'number_of_reviews_l30d',
                 'neighbourhood_group_cleansed']]

PCA_airbnb = PCA_airbnb.drop_duplicates()



X = PCA_airbnb[['price','number_of_reviews','accommodates','bedrooms','beds','minimum_nights','maximum_nights',
                'maximum_minimum_nights','availability_30', 'availability_60', 'availability_90','availability_365',
                 'number_of_reviews_ltm', 'number_of_reviews_l30d']]

# Standardize the data
X = StandardScaler().fit_transform(X)

# Dimensions reduction by PCA 
pca = PCA(n_components=2)
components = pca.fit_transform(X)

# Scatter plot the 2-dimensional PCA of airbnb features grouped by borough
fig = px.scatter(components, x=0, y=1, color=PCA_airbnb['neighbourhood_group_cleansed'],
                 symbol=PCA_airbnb['neighbourhood_group_cleansed'],size_max=1)
fig.update_layout(title_text="Dimension reduction for Airbnb Data across Different Boroughs")

fig.show()

VII. Conclusion¶

In conclusion, this project presents a general picture of Airbnb in New York City in terms of neighborhoods. We manage the "new" type of Airbnb data together with the traditional census bureau data to visualize the Airbnb and demographic information in New York City. As the Airbnb data is automatically generated from our digital life, it makes every host and guest as a contributor to our study. Despite the data set containing some extreme values and outliers, it is quite up-to-date and presents a fair picture of New York City's Airbnbs.

After combining the Airbnb data with the NTA Demographic data, we visualize Airbnbs geographically and have a deeper understanding of their distribution in NTAs and boroughs. From the maps and the joint plots of commuting time, it is obvious to see that Airbnbs are more likely to locate at city center while guests also prefer downtown areas as their accommodations.

However, when we try to reduce the dimensions of the variables, the cells cannot separate according to target groups. In other words, people's preferences in choosing Airbnb are based on its location, rather than Airbnb's internal characteristics (such as 'beds','minium nights').

Therefore, we can argue that location is the main selecting criteria when people book Airbnbs.

VIII. Appendix¶

The description of the meanings for all variables can be found from the links below:

  • Airbnb data: https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896
  • NTA data: https://geodacenter.github.io/data-and-lab/nyc/

This project is producted by our two group members' cooperation.